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ABSTRACT - 

This review of methods used to assess the clinical 
performance of medical students focuses on four common assessment 
approaches: (1) the examination developed by the National Board of 
Medical Examiners (NBME) ; (2) systematic, miiltifactor evaluation 
methods; (3) observation techniques; and (4) problem based methods. 
Analyzed in conjunction with each approach .were reliability and 
validity data as we^l as practicality of the assessment approaches. 
The reliability and validity data are ^extensive and high for the 
NBME.Whe NBME is the* least complicated instriment to administer and 
score. However, it cannot assess client-cliaician interactions, 
utilizing live subjects. Reliability and validity data on observation 
methods are sparse; and where data- exists, the coefficients are 
generally low. Himltifadeted evaluation techniques have provided more 
accurate assessments of student competence^ but require more time ajid 
more people .to administer multiple assessments. A final/ issue related 
to the assessment of' clinical competence involves th0 determination 
of a generally acceptable definition of competence. (Author/BW) 
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Abstract . • 

This review of methods used to assjess^the clinical perform'ahce 
of -medical students focuses on^four eoirtnon assessment approaches: 
1. The examination developed by the Natipnal Board of Medical 
Examiners (NBME), 2. Systematic, multiTactor evaluation methods, ; 
3. Observation techniques, and 4. .Problem based methods. - Analyzed' 
in-conjunction^ith the approaches review were reliability and " , 
validity dati' as well as practicality of the assessment approaches^ 
The conclus-ion highlights, strengths and weaknesses in commonly ' 
used clinical assessment apprqaches aritd summarizes related 
measurement issues. 'V ^ ^ 



ASSESSMENT OF THE CLINICAL PERFORMANCE OF MEDICAL STUDENTS:-. 

A SURVEY. OF METHODS ~ ' 

The traditional method for assessing medical students' clinical 
competence is , through/ oral and/or written examinations developed by - 
medical faculty. The oral examination questions are' presented tp 
individual medical students by a panel of medical faculty members 
at patients' bedsides or in conferenc,e room settings. The questions 
are to reflect students' clinical experiences. Responses are rated 
rby^ an examining panel to, determine whether students passed or failed. 
Oral examinations proved to be ujarellable measures of clinical com-^ 
ipetency due to/ in terrater sdoripg differences, the subjective nature 
of the assessment procedures, and the problem of defining clinical 
competence (Levine & McGuirfe, 1970). Desfiiite the reliability pro- . 
blems, the oral examination remains the major method for evaluating 
clinical con^etence in Great Britain, Australia ahd Canada. The 
Canadian College of Family Phjf'slcians have improved' the reliability 

■ ■ • s ■ ' . ... ■ [ ' ■ 

of the oral examination in Canada by defining niinl^ial clinical com — 
petency 'Standards and constructing an objective problem solving 
examination (Van Wart, 1974). In many cpuntrles, the oral examina- 
tlon is employed as one segment of multiple cllnlcfLl assQ3sment * 
modes. In the United States, the oral examination was discontinued 
as a subtest of the' National Board of "Medial Examiners <NBME) be- . 
cause of reliability problems (Hubbard, 1971). 



Originally, written clinical examinartiions were teacher deveJ,oj 
and were qften unreliable and invalid* measures of clinical competence. 
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objective based written standardized examination* has 
method for assessing clinical competence, but relia- 
ive based and standard examinations have improved 
rs: 1) examination questions reflect predetermined 

nimal levels of competency ar§ stated; 3) many 
validated (content and ptedictive validity studies); 
lity, of written examinatioi^ hM^^ 5) ob- 

ems have eliminated moist scoriitg problems (e*.g. , 



jective test 

^Interater differences) ; and 6) 'competency is based on th^ extent 
that performance matches the objectives. 

However, the use of written standardized tests has presented 
another problem.- The uniquenessnof^medical school clinical .programs 
and the richness of individual student experiences may no^^^be tapped 
by written standardized examinations. Many schpols employ the writ- 
ten examination in conjunction with other assessment methods to 
paint a clearer picture of medical students* clinical competence. 

This paper will review: 1) approaches used to define clinical 
competency of medical studnets; 2) methods for assessing clinical 
conpetency; and 3) outcomes of the assessment approaches.. The latter 
will ^ocus on reliability, validity and practicality of thd clinical 

o ' ■ 

cojnpetency assessment methods. 



Definitioris of Clinical Competence ^ • 

Rarely have explicit statements been presented of what the under- 
graduate medical student i^ expected to perform and at what level of 
proficiency. Some performance goals have been reflected in objective 
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based t/rlttea examination items, but many studies suggest that the 
, correlation between the written examiniation grades and" actual 

performance is usually small and sometimes inverse (Wingard ^ 
& Williamson, 1973). Instructional designers haVe recommended using 
behavioral objectives as means for refining the definition of clinical 
competence. The use of behavioral objectives has only recently begun 
to receive widespread acceptance in-^calized test development. 

A different approach involves thie analysis .of tasks performed 
by clinicians that are assessed through observation. The tasks are 
then 'classified. Task analysis resulted in the development of stan- 
dards to assess students during the clinical years (Adaihs & Mendenhall, 
1974). Two methods of developing definitions of clinical competence 
emerged from this approach: 1) identification of .elements leading to 
satisfactory performance; and 2) measurement of performance outcomes - 
regarding patient care. ^ 

. For i^ssifying clinical^competence, the critical incident tech- 
nique (Flanagan, 1954) has been the most widely accepted approach. • In- 
* cidents of good and bad performances are identified and classified. 
Hubbard et al. (1965) identified nine major categories of clinical 
competence: History, PHysical Examination, Tests and Procedures, 
Diagnostic Acumen, Treatment, Judgment and Skill in Implementing 
Care, Continuing Care, PhysicianjPatient Relations and |^esponsib lilt les 
as Physician.- Each of the nine categories were defined as operational 
tasks. With the critical incident technique, classification results 
from observed outcomes of performance^ affecting patient care. * 

Since 'Clinical competence definitions are depen^^t on observed 



outcomes arid bdiaviors, the compleicity of Cbferving and defining the 
constructs and activiti^ to be measured is difficult and complex. 
Therefore, clinical competency meaisures are often imprecise, 

• 'V ' • ■ ■ ' . ■ ■ ' 

Methods for Assessing Clinical Competence * ^ 

Performance assessment is generally measured^ by analyzing pro- » 
cesses to solve problems or by analyzing products or outcomes that re- 
sult C^om solutions. Medical student evaluation is conducted en- 
tirely on process measures. Since medjLcal students are supervised 
in their care of patients and ^hus do not -assume -direct responsibility 
for treatment, direct responsibility is a necessity for performance • 
assessment to be based on a prqddct or ''production" mode of asses*s- 
ment^^^hiis, the process mode is the method of evaluating clinical 

competence, '\ * , . ' 

* * • ^ ■ ■ . . - . . . % 

The review by Wingard and Williamson (1973) indicated tKat little 
or no correlations existed between process measures » (grades) and / 
future performance. Thus the impetus has arisen fbr^he development 
of test procedures with predictive validity that improve the product - 
competing medical school. , ■ ^ ^ 

A variety of methods exist for the lassessment of clinical com^" 
petence. Four common approaches for measuring clinical competence 
will be reviewed: 1) the National B6ard of Medieval Examiners examina- 
tion; 2) Systematic, multifactor evaluation methods; 3) observation 
techniques; and^ 4) problem based methods. 



National Board of Hedldal E xaminers - ' 

The exandnatioM >af the National %^ of MediJal Examiners con- 
sist. of three parts: 1) Part Ojie.PrecllniW Sciences (first two 
yearff); 2) Part Two. Clinical Science (third and fourth years); and ' 
.3) Part'Thiee, Clinical Coii5)etence (internship or residency) . The 
Preclinical and Clinical Sciences examinat ions have been estab lished • 
as highly reliable oeasures of medical knowledge and a ^mH^ate's 
ability to apply knowledge to the problem at hand (Cowl^ & Hubbard. 
1954; Hubbard S^Cowles. 1954).. Part TVo.has yielded' lower reliability 
coefficients than Part One;. Reasons cited for the lower r^Uabllity ■ 
of Part Two when coiilpared with Part One are: 1) the increased com- 
plexity of Part Two subjects w^ien compared to Part One subjects; 2) 
the variability of the methods for grading students' (resulting in the 
lower reliability of instructor ratings) ; and 3) the homogeneity of 
clinical year students. Statistical 'studies yielded evidence that ' 
NBME, Parts One and Two, generally: correlated more higKly with in- 
dependent, estimates of student i>roficlency by ittstructors. demon- 
strated a- reliability of measurement tore adequate for precise grading, 
and differentiated among the candldaj^a due to'the score distribution 
(Cowles & Hubbard, 1954). 

Before 1961, Part Three of the NBME was usually conducted as a 
bedside, oral examination (Hubbard, Levlt, et' al., 1965). A case- .J 
history and physical examination were taken for a patient. Then thi 
M.D. was questioned by an examiner who was not familiar with the ;? 
patient. The examiner would then use the patients* chart to develop 
an examination in the form of a quiz session. The procedure would 



then be repeated for the same M.D. \,ith a ^different patient/ and 
examine*. Frequently,, the inter-s^rer reliability and evLuation 
from one examination to the next was low or negative. Three variables 
impacted on the bedsld4 evaJ^ons: the. candidate,, the patient and 
the examiner. 

Hubbard, Levlt, et al. (1965) reviewed the new techniques employed 
by NBME to Validate the clinical competenqe measure. Part Three. Clin- 
ical competence as defined was based on feedback from questionnaires 
and interviews secured from interns citing incidents of clijiical per- 
fonnance. Nine areas were considered in defining clinical competence: 
history, physical examination, tests^d procedures, diagnostic acumen. " 
treatment, judgmen^ and skill in tmpleienting care, continuing' care', 

. physician-patient relations, and responsibilities as a physician. Sub- 
categories, defined in behatvioral terms, existed within the nine cate- 
gories. Clinical competence was determined to be best measured by 
developing Part Three using motion pictures of carefully selected 
patients, a section calling for the interpretation of presented clinical 
data (graphic and piptorial form) and iwrogrammed testing is used. Thus 
tfie patient variable was controlled by standardizing the patient ex- 
perience viewed for assessment by the students. The questions po^ed 
were asked with objective responses' as solutions to problems presented 
(e.g.., clinical data, patient motion pictures, etc.). Thus, the result- 
ing examination was a more dbjective measure. 

Test analyses yielded Kuder-Rlchardson reliability coefficients 
(for two equivalent forms) to be 0.83 and 0.87.. Since many critical ' 

. incident problems were included*in the NBME, Part Three, the test was 
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stated to have high content validity. No predictive validity data was' 
available in the Hubbard, Levit, efal. study. HoweJ^, correlations 
of the NBME Part Thjree examination with NBME Part Two yielded corre- 
lation coefficients^ ranging from 0.30 to 0.65. The coefficients pro- 
vided 'some indication- t^atjA the results were fairly independent of e^ch 
other. ^ . * 1 

Hallock, Christenson, et al. (1977) reported data on the clinical 
performances of medical students in three and four year medical cur-' 
ricula in five category areas: 1) fund of knowledge; 2) medidal skills; 
3) prbblem solving; 4) professional- standards; and 5) reliability of * 
the student in performing his/her duties. . A five point grading system 

V 

was used by the faculty to rate student performances in the five areas. 
The results consisted of assigned points on the five categories by 
faculty members and NBME scores on the pediatrics, medicine, and ob- 

stetrics/gynecology tests. Data analysis indicated that the four year 

° ■ . . ■ % 

students scores on the NBME correlated most highly on the fund of knowl- 
edge category (they also had higher NBME scores). Three year students 
scored higher on problem solving and professional staindarW. The 
Hillock, Christenson, et al. study provided some evidence* of the pre- • 
dictive validity of the NBME, Pa'rt Three. 

The NBME, Part Three, has demonstrated high reliability, hig^i con- 
pent validity, godo predictive validity and high practicality in terms 
of administrative ease/ 

Systematic, ^Objective Evaluation Methods ' 

A variety of multfaceted approaches for assessing clinical com- 



petence have been developed.. Mist of the devei6j»ment occurred during 
the late- 1^60> and the 1970*8. The multifacetW assessment approach 
attempted to ineasure student performdriB and Competence in the clinical 
Setting.. Direct observation, olral and written examinations, each. adding 
to the total measure^ of clittical competence, were common approaches for 
assessing clinical competence. In many instances, the NB^E was used 
as one. oL the written examinatidns. The goal of the approaches re- 
viewed -ir^- this section was to add objectivity, variability and s.tructure 
to the assessment process so that Measurements of competence would be 
more reliable and valid. . • . 

,. Reviewed in this section are systematic, multifaceted approaches 
developed by Geertsia and Chapman (1967), Graham (1971), Printen, 
• Chappel and Whitney (1973), O'DonoJiue and Wergin (1978) and Sheehan, 
et al, (1980).. . 

^ Geert'sma and Chapman (1967) '.studied the system of evaluating stu- 
dent performance implemented a^t the University of Kansas which attempted 
to measure eleven dimenslonsr 1) fund of information; 2) comprehension: 

3) problem splVfng; 4) ■ reliabiUty; 5) application; 6) judgment; 7) 
originality; 8) rapport with i?atients; 9) poise; 10) ethical stan- 
dards; and 11). llkabllity. Four additional dimensions were added to, 
aid in the. preparation of -recommendations and- summary reports of student 
progress:. 1) probable success as a student; 2) probable success as a • 
physician; 3) - acceptability. as a graduate student or house officer; and 

4) overall performance. t " ^ 

.The total ^Performance dimension could override an un*atisfactory ' ' 
rating on one of the first 11 dimensions 'or could serve to offset 



superior ratings. A three category descriptive rating scale (urisatis- 
factory-sup^rior) employed by the instructor of each student's 
m^jor .course to evalu^Je -student performance! In some departments at 
the-University of Kansasi faculty met collectively to evalu^e their ' 
students. In others, faculty members handed in evaluations- arid a^' 
consensus was reached in the evaluation of %ach student's work in 
the course. The dimensions were printed on small cards which contain 
spaces for narrative information. Analysis of the data indicated "that 
the ratings of the dimensions w^re highly interrelated with two 
factors being identified: general cognitive factors and a noncognitive 
factor centering on ethical standards. Instructors tended to give 
unsatisfactory ratings on cognitive dimensions and superior ratings' 
on noncognitive' dimensions (superior rating's are reported more fre-. 
quently than unsatisfactory ratings). Since the -dimension's wefe^de- 
termlned a priori,, the method suggested, that the evaluat^iom^mensiOns 
be revised so as to provide operational guidelines for each dimension 
derived. ^ , . 

Graham (1971) attempted to define behavior, expected in a clinical 
clerkship and developed a method" of .reporting such performance. The. 
evaluation form ifor clinical competence has^nine'^sectlons: 1) attain- 
ment of global objectives; 2) descriptive checklist; 3) clinical per- 
formance checklist; 4) narrative; 5) suggestions/ comments';' 6) career 
choice recommended; 1) degree- of change; 8) -other comments ; and 9) 
final evaluation. , , 

i ■ ■ - ' ■ ' ■ 

The evaluation methdd is very time consuming (due to its detail) 
a.nd asks questions that sometimes cannot be ai^ered due to the lack ' 

yt^*-.- ■ ■ ■ ■ 



of familiarity with students (on. the part of instructors). At the 
beginning of the clerkship, students evaluate themselves using the 
evaluation form -wliich was, used as a part of the departmental 'summary. 
Faculty, preceptors and staff involved with each student 's program 
also received copies of the evaluation fortns for their comments (at 
the end of the term). The forms served as .the basis for evaluating 
student clinical competence. The report is perused by t^ student 
and discussed with the undergraduate coordinator, with em^is on 
weaknesses and differences of opiqJLon as well as strengths. 

Printen, Chappeil and Whitney (1973) implemented a comprehensive, 
objective evaluation process to assess the clinical performance of 

junior medical students based on an oral examination, a written 
examination, clinical petfprmance and psychomotor skills. Their 
system considered behavioral characteristics, mastery of cognitive 
material, and performance of psychomotor skills, and culminated in 
the development of a student profile to provide student feedback and 
objective evidence of student performance and course evaluation datai 
Oral examinations were held weekly in small groups (two to four stu- 
dents and the lAstructor) and focused on the cognitive objectiVe^. 
Clinical evaluation was based on ratings by at least one faculty 
member, one resident and one intern, on 10 'clinical performance var-. 
lab les previously rated by the surgery faculty. Significant rater 
differences were investigated thoroughly by the clerkship directoi:. 
Psychomotor skills were assessed by having the student perform certain 
tasks and then graded on a pass-fail basis by a resident or member of 
the surgical staff. The written examination was developed around 



the departmental cognitive objectives and focused oa patient oriented 
qjiestioris in oiinical problem situations. The data were analyzed by 
cqn^ufer based on predetermined weights and resulted in a student 

evaluation profile. The Printer ^et al, method eliminated some of 

\ _ • * ■ 

the subjectivity fr^m 4:he evaluation of students' clinical perform- 
ance. The* authors considered their greatest contributions to evalua- 
tion to be the structuring and ordering of clinical performance char- 
acteristics on a weighted basis (provided guidelines for assessing 
effective and ineffective clinical performance). 

O'DorioJiue and Wergin (1978) developed a proficiency assessment 
process to evaluate the performance of medical students during a 
clerkship in internal medicine employing preceptor evaluations of 
on-the-job performance as well as independent written and oral examina- 
tions. Preceptor evaluations consisted of ratings, on a four point 
scale, using standardized evaluation forms. Every student was 
evaluated by at least one preceptor who then submitted a separate 
evaluation form. Written examinations were developed based on 
questions submitted by faculty in each of the clinical divisions in 
the department of medicine. The questions were mostly of the mul- 
tiple-choice variety. Thirty minute oral exams were given by two 
faculty menbers who had not served as preceptors for individual 
s tudents every three months . V . 

The examiners were trained in bral examination techniques. 
They were also presented with a listing of each student's patients 
and diagnoses for use by the examiner.. Eacli examiner provided in- 
dividual scores and then jointly' decided on m oral exam score. 



. :.Flnal grades were deteinined by a clerkship connnitt^^e: ,the clinical' 
was given a weight of 66.%, writte^j^^. oral exWna/ions, 17% each. 
Reflected was the opinion bf the committee that clinical ratings 
should carry the most weight in the determination of a final grade. 
The results of the study indicated that between 10 and 70 percent of 
the variances in clinical ratings w'as- due to situational variables in 
the performance of ' individual students and "rater mirror. . CeiUng ef- 
fect was cited as a possible contributory factor to the low'inter- 
ratfer correlations. The oral examinations had high reliability (.754> 
due to the nature^of th^ examination (demonstration of one sample of« 
behavior, and lack of correlation with other measures) . Intercorye- 
lations among the three raters of students indicated small inter- 
correlations (the examinations appear to contribute different kinds 
of data about student knowledge and competence). The study concluded 
that neither the oral nor written examinations correlated highly with 
performance ^essment and that considerable intrastudent error 

301, " ' -I . . •. . 
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existed. 

Sheehati, et al. (1980) studied the role of moral judgment in 
predicting clinical performance. Moral reasoning was assessed by 
the Defining Issues Test and the Moral Judgment Interview. Clinical 
performance was assessed by a scale which measured eighteen per-, 
formance characteristics covering medical knowledge, task organiza- 
tion and interpersonal relations. The results indicated that moral 
reasoning is a predictor of clinical performance. High moral ' 
Reasoning appears to exclude the possibility of poor performance. 
The very highest level of clinical performance appears never to be 
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reached by those at the lowest level of moral thought. The subject., 
were residents. 

' ' *W ' ' • < 

-Ih^ systeiaatlc. objective approaches for assessing clinical 

conipetence have in common good content validity, fairly low inter- 
scorer reliability, no predictive validity or test reliability Wa. 
All of the studies defined a conciptualizatidn of clinical competence 
and structured their assessments based on their definition of com- : 
petence. As a result. ct,ntent validity appeared to be high (Ge^rtsma 
. & Chapman. .:1967; Graham. 1971'; Printeh. et al.. Ift73; 0»Donohue &. 
W6rgin. 1978; apdSheehan. et al.. 1980?.^ Inter-scorer reliabilities 
ranged from poor to low (Geertsma & jB<»apman. 1967; Printen. et al^ 
1973. and O'Donohue & Wergin. 1978). * ' * ft"^ 

The systematic, objective assessment approaches require great . 
amo^nts^ of time and manpower to administer. Thus, the systematic as- 
sessment approach is not the m,st practical of the four assesL'nt 
modes under review (Geertsma & Chapman. 1967; Graham. 1971- Printen. 
et al.. 1973). The Sheehan. et al. study (1980) proyldes some evi- 
dence of predictive validity, but moral judgment is related to per- ^ 

formance characteristics rather than to examination measures. - ' 

Systematic, objectiye approaches for evaluating clinical com- 
petence need further development before their use as reliable and 
valid measures can be documented. The lack of practicality will be 
an^ssue as long as the length of the assessment process and the number 
of people involved in the process remains as stated in the studies. 
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Obsemyatlon Methods 

Direct observation by /stiff members of clinical student per- 
fdrmance is a popular evaluation method. Theclinicdl student is 
usually ob^rved at the patient bedside performing routine t^ks 
such as history taking, rapport building vith the patient, physical 
examination. and dajka synthesis. Reviewed herein are studies by ' 
Hinz,(1966)|. H^ss (1969). OakJ et al. (1969) and Turner, et al., 
(197?) employing techniques si^h as videotaped ol^ervatlon and 
student-preceptor bedside obskrvatlon. Rating scales for assess- 
ing clinical observation are also discussed. 

Hlnz (1966) described the develdpment of a method of direct 
observation of students concerned primarily with performance in 
history , taking and physical examination. Hlnz devised a study to 
examine the following: l) to df/termine whether , teaching is improved', 
by having the instructor, observe at the bedside during the student's 
case writing; 2) to develop mo re 'Objective criteria for performance 
in the case method (to establish quantitative as well as qualitative 
descriptions of performance); 3) to determine whether direct obser- 
vation makes apparent aspects oJ student performance that are not 
otherwise apparent^ and 4) to determine the following effects of 
direct observation on faculty and students: a) effect on the 
patient-doctor relationship of having an obse^^ at .bedside; b) the 
reaction of students to being observed; c) the cost to faculty in 
time; and d) the Impact of the faculty on student performance. 
Components of the patient examination were compiled from listings 
provided by a group of interns (physical examination. Interview and 
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the organization and synthesis of data). All items were tested fo'r 
value in meeting patient examinations and for observability. A group 
of internists and psychiatrists observed a group of volunteer' fourth 
year medical students during patient work-ups. They found that: . 1) 
untrained rateW yield inconsistent ratings; and 2) students regarded 
the experience as an^ opportunity for tutorial aid in his to r^ .taking ai^d 
physical ^ination. Items were tategoried^ccording to portions of 
the patient examination with particular attention on ?he content of 
the illness an^ the method for securing, the history. The itenis were ' 
general and a comprehensive assessment applying to any case could be ' 
developed. Fifty items were included , in the rating scaje with suf- 
ficient space for notations. Katers wete trained using videotaped 
medic-al students, conducting patient examinations. The pilot study 
consisted ol^lir? sitting at bedside . as a student did ^ 
.work-up of the^patienfcV^i^teAards.-the student summarized his find- 
ings and prwented them to the rater and they discussed the case. 
-After discussing- the case, the rater used the 'recorded observations 
as the^asis for reviewing the student »s performance. A great deal of 
interrater inconsistency was found to exist. The pilot study aided in ' 
the development of 8tandarda\f performance and in enhancing the quality 
of student skills in subsequent|( weeks but quantitative limitations ex- 
isted (a "need to weight items, etc.). The rho values of rank order 
correlations of like pairs of raters ranged from .55 to .79 (like - 
both from the tapes or live) .Rho values were low for "unlike" raters 
(.42 - .58). Thus, live and taped observations were not rated in the 
same fashion. For Interviews, observers recorded an overall grade 
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indicating whether the interview was good, fair or p^or. The rho value, 
for the correlations between score an4 grade was .8^ Direct obser- 
vation is V potentially usefui tool for qualitative evaluation of stu- 
dent perfornances, especially when that evaluation is used for in- 
struction of the individual student. For quant ittitL purposes, direct 
observation has limited use since, raterrf differed significantly in their 
view of various coiiponents of the tasks, and' because adequa'^te' reliability 
has not been achieved (due to th^ inabiUty to structure an adequate 
test of reliability). j 

Hess (1969) studied the reliabiUt^^^f two tating scales based on- 
a behavioral deflpition of skill in evaluating student skills in re- 
lating to patients. Format A required the raters to classify single 
units of student behavior (an uninterrupted, purposeful action by the \ 
students) under one or more of the 11 categories. Students were 
videotaped and their interviewing skills were rated. More traditional 
Format B consisted of i series of statements which described various 
effAifive and inef^tlve types of observable student actions. Each 
student was videotaped and their performance evaluated on a 10-point 
continuum. Rating scores from Form A (the Interrater analysis) were 
more reliable than the scores from Form B (Overall A » 0.92- Overall 
B - 0.66). The, importance o^:^^^ design of the rating instrument' in 
enabling humans to function as rel^^ble data ^ recording instrument was 
noted. A rating system which faclliti^ed discrete judgment proved to 
be more reliable than the instrument requli^n^ fewer biit more global 
judgments. Hess concluded that the in terac^i<jn analysis- format for 
assessing learning provides a much; clearer pictiire of each student's 
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Interview performance than did Fo^t B and provides more clarity and 
'\^V precision of measurement. . * ' ^ •« 

Oakes, Scheinok and Hust^d (1965) studied an objective rating 
scale lised to assess student performance in a clinical clerkship. The 
scale as-sessed clinical clerkship performan^'e'in 11 attrja>utes: ap- • 
pearance. deportment, maturity, cooperation, scholastic ability, stu- 
dent effort, interest in service, responsibility, professional com- 
petence^ interpersonal relations and chart neatness/^vromptness. 

• ■ ■ ' 

dents were rated by preceptors usin^ a four-point descriptive scale 
(poor-excellent). ' ' " ^ 

An overall .numerical rating (ranging from 65-100) was also noted 
by the preceptor on the card. Objectivity was facilitated by providing 
the preceptor with a three-page form listing descriptions' of the 11 
attributes. The descriptions served as guidel^es for the overall . 
rating and facilitated an accurate estimate of overall clinical ability*" 
because of the need for the preceptor to examine individual components 
of the student 's attitude and performance. This study measured: the 
reliability of the preceptors awarding objectiv^ grades compared with 
, overall grades, the reliability of preceptors* ratings depending on , 
academic rank and the percentage of mismatched grades. The study con- 
cluded that preceptors (almost all of them) did giv^ a falling- grade 
when failure was indicated, that 42.7% of the preceptors' grades dif- 
fered from objective grades by more than 3% (but only 4Z differed 
from objective grades by more than lOX) and that associate/assistant 
professors and Instnictors were -more reliable in grading clinical per- . 
formance (fewer.mismatches) .than were residents and full professors. 



ERIC , 



Turner, et al. (1972) questioned whether clinical competence 
could be evaluated by observing the performance of individuals in the 
patient car^ situation. Student clinical performance was videotaped 
and rated by speciaUy trained^ pediatric resSlents. The researchers 
wanted to know if good clinical jperformance could be differentiated 
from poor clinical perfo'rmance. Hess's (1969) method for assessing 
iute^ersonal and communication skills was used', using two approaches 
for the assessment of each variable :^1) a tally of the specific acts 
which were predefinedas contributing to the variables in question; 
and 2) global Eatings. The data indicated that' Variables used to 
evalulte. clinical performance can be better evaluated through tabu-' 
lation of specific acts as opposed to global judgments (the form in 
which the variables are expressed affects reliability). Trained 
•raters agreed on many, but not all, physical examination procedures 
performed by students. Agreement among the professionals was impor- 
tant in .the determination of variables that represent competence (a 
priori quality standards are poor indicators). 

. As with the systematic, objective evaluation approaches, direct 
observation has limited and generally low reliability ^nd validity 
data. Scorer reliability problems were reported (Hinz, 1966; Hess, 
196^;-0ak3, et al. 1969; and Turner, ep al., 1972). In one study, 
high predictive validity was indicated when observation scores were 
compared with 'grades (Hinz, 1966) , but no other predictive validity 
was indicated In the remaining studies. Reliability coefficients 
for .rating scales^hich^^acilitated ^th^formttlatiok of discrete judg- 
ments were higher than scales that utilized global judgments (Hess, 



1969; Turner, et al., 1972). ' 

The. data in support of test construction yielded information 
which supported the use of observational techniques for improving 
student clini'fcal performance in a qualitative sense. ^ 

The use of rating scales and observation for evaluation clinical 
competence is a time consuming effort. Rating scales fa^l to capture 
all import ant^^ets of a student's clinical competence and the rateL 
may prejudice the outcomes of an evaluation, by either being too * 
familiar or too unfamiliar vAth -students being observed and rated. 
Observation is rarely an objective method of assessment.^ 

Problem Based Approaches - • 

The Problem. Based Examination approach focuses on defining 
events likely to be experienced by clinical students and basing 
assessment of clinical competence based on how the student solves 
the problem. Studies conducted by Harden, et al. (1975), Newble, et 
al. (1978), Harden and Gleesen (1979) and Newble, et al. (1981) are 
reviewed. 

Harden, Stevenson, Downie and Wilson (1975) introduced a 
structured clinical examination requiring students to rotdke from 
one station to another dn a: hospital ward with various tasks assigned 
at each station (e.g., station one, carry out some aspect of a phy- 
sical examination, station two, answer multiple choice questions on 
the physical -examination) . The cueing effect that usually exists 
in multiple choice examinations was minimized because the students 
cannot go back to check omissions in their actions and thus resulted 




in a fairly cue^ree exami^^ion. ' , , 

The structured examlnatioh setting allowed variables and 
exa^nation complexity to be controlled, aims could be more cleaily 
defined and more of the student's knowledge tested. Thus the^exam- 
inationjvas more objective Md a marking strategy could be decided 
in advance. The examination resulted in improved feedback to stu- 
dents and staff. Analysis of examination results indiea^ted that • 
poor cllndcal perfonhance was due to: 1) all around inadequacy;. 2) 
deficiency in some aspect; and 3) deficiency in specific subject' 
areas. A study was conducted grouping traditional clinical examina- 
tion and objective clinical observation with written examination • 
scorearr The traditional scores^correlated 0.17 with the written ' 
examinations while the objective clinical evaluation scores corre- 
lated 0.63 with the written. This method allowed for more control 
over the testing situation and complexity of the material. ' 

Newble, Elmslie and' Baxter (1973) developed a patient problem 
based method for assessing clinical competence in specific areasV A 
listing of problems likely to be experienced by interna o/as derived 
by a consensus process using a wide selection of clinical teachers. 
A specialist was asked to develop a pa.tient problem blueprint in . 
such a^manner as to mak^ it fit the scope of interns' experience. ^ 
IntemSjf^ residents, etc. reacted to the blueprint. The blueprint- 
was expanded to require more detailed knowledge in key areas. The 
expanded problem blueprints became the basis for selecting appro- 
priate test methods and for the construction of test items. The 
criterion was nof defined in precise behavioral terms but the problem 



blueprint provided precisllon to test construction. Ope. ended and * 
_^multiple Choice exandnation questions were developed for each of vlZ 
. expanded test blueprints. Students circulated among exandlnation 
stations. The exa^ation was administered to senior and Junior 
medicaj^^stude^ts as w^l as selected r4ldents 'and interns (the latter 
provided criterion levels of perfor^^e^ The nu^er of participan^ts 
volunteering time to thl task indicated that the new approach was ' 
acceptable tp faculty members and students. Sixty three percent of 
the:8tudents felt, that a mixture of multiple choice and free response 
questions were appropriate for the final examination, but 842 felt 
that the' free response items gave a more accurate assessment of 
their ability. The students rated the content test as being of high 
(47%),or moderate (53%) clinical relevance. Ninety five percent of 
the students indicated that the practical section contributed to a 
inore accurate assessment of their competence than the traditional 
clinical examination. The practical section content was rated as ' 
either highly (TaSs) or moderately (26%) relevant.' This approach was 
considered to be practical and feasible to administer. 

Harden and Gleesen (1979) discussed a procedure designed to • 
assess clinical competence at the bedside employ^ the objective- 
structured clinical examination (OSCE) . The OSCE separa^d con.- 
petence areas intp various assessed components. Each component serves 
as an objective for each station' in the exam. This method paralleled 
that outlined by Harden, et al. (1975). but provides a detailed' 
method for implementing the procedure. No validity or reliability 
data were provided. 



Newblf , Hoare and Eloslle (1981) provided validity and relia- 
biUty data for the problem based CRT of clinical coni^petence^ The ' 
results demonstrated that the examinations have a high level of ' con- 
tent" validity . (as' assesjid by teaching staff and students) and showed 
some evidence for construct vO^dity. Ninety two and a half percent 
of students felt the content of the test was of high or moderate 
. relevance. Ninety five percent rated the clinical in a similar 
fashion. Satisfactory levels of internal consistency were estab- 
lished for the whole test. Mkrfcer Validity w^ satisfactory pii all 
test sections except those requiring examinations to rate practical 
skills. Eifediction and concurrent validity data could not be 
accurately secured due to inconsistency in resident and intern's 
scoring. The test correlated highly with combined marks in medicine 
and surgery (r = 0.62, p 0.01) with a similar level of correlation 
existing for the new examination and subsections of the final exam- 
ination (Medicine r - 0.54, Surgery r - 0.62). The new examination 
written component was more highly correlated with the final exam- 
ination (r = 0.54) than the practical component (r =« 0.11),. Scorer 
reUability for the /ree response section of the examination was 
very high (0.95). Reliability in the stations whose students were 
rated, ranged from 0.25 - 0.77. / 

ReUability data have been reported for the problem based as- 
sessment approach (Newble, et al., 1978; and Newble, et al., 1981). 
Predictive validity dita ranging from 0.17 - 0.62 have been reported 
for respective problem based examinations when compared with other 
written eia^natlons (Harden, et al., 1978;. and Newble, et al., 1981). 
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^ The exWnaUons involved students performing specific taskT^ they 
mov^d through a variety of statins in the clinical netting. This , 
practice is time consuming in the amount of time required to rotate '^ 
through ali of the stations. Scorer reliability was low to goo^in' 
the one example cited (Newble, et atl.,1981). 

. ' • ■ i , • ■ 
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Conciuslona \ ^ 

, The literature reviewed provides a summatiLve overview of the 
methodologies for assesslhg the competency of medical student ciin- ^ 
icians . Of particular interest is the reliability and validity data . 
pertaining to therfo|ir approaches. 

The reliability and validity data are extensive and high for the 
NBME examination as an assessment instrument. The NBME is the least 
complicated instrument to administer and score. T^Je review could 
have ended with the discussion of the NBME if the medical profession 
was interested in only what is practical. M^surement problems do 
exist: 1) the NBME is a standardized, norm-referenc'ed measure; and 2) 
the NBME is an external examinations used to measure in a nonstandard 
setting. As a standardized, iiorm-referenced measure, the NBME 
assess^ a sample of behaviors that may reflect competence. Subtle 
and situational information about student competence. can not be ad^ 
quately assessed by the NBME. A.major gap ia the inability to ^s ess 
client-clirilcian interactions utilizing live subjects. Pass marks 
for the NBME are low which may indicate that the KBME is imprecise. 
The pass mark for the NBME. Part III. was 290 nationally in 1981 (800 . 
is the niaximum score). The University of North Carolina Medical School 



pa^s mairk was' 320 in 1980;. ' 
V. Another issue is the use of kn extetnal examination to assess, per- 
:forinance in s:ettings with variable curticula and student populations. ' 

that the NBME was not a relevant measure 

for stu4ent success at k.oidwestern medical school -as was an objective 

based examinatibn. . . • 

V Reliability and validity dat;a on observation methods are sparse. - 
Where the data exists, the coefficients are generally low. Most of 
the reliability data pertain to scorer reliabilifcy. The literature 
concluded that scorer reliability coefficients are .low vhen observa- 
Uon-tfechtiiques'. and rating scales have been used as kssessment instru- 
ments. The recent efforts toward standardizing observation rating 
scales has sligjitly reduced scorer inconsistency (Newble, 1976) . 

i- Multifacebed evaluation techniques have provided more accurate 
assessments of student coopetepce. 'Onemultifaceted evaluation stra- 
tegy involves' utilizing a. written and clinical, observation measure ' 

■t. '■ '.' ■ •'■ ■ ■ . , ■ *-' ■ ■ ■ ■ ■ ■ ■ . ■• 

(equally weighted) . The NBMe' or -an obj ective based, teacher cc)n- ' 
structed exaninatton usually serves as the writ ten .iigasure. Pro- 
blems associated with the multifaceted approach includ^fthe length' 
of time attdTniunber of people required to administer multiple 4sse8s- 
ments. The data Indicate that be reliable ' 

and valid- measureV, -thou^^^^^^ practical measures I 

; ' V ^^Tial^ is^ue rdlated to : tlie assessment of , clinical Cdmpetence 
invoiyes'the determination of a generally acceptable definition of 
competence which can be utilized by m^edical exami^iatioh" . 
Clinical assessment definitions haye focused on diverse situational 



and behavioral definitions of competence: 1) clinician-patient bedside 
behavior versus- patient management problem; and 2) evaluation of stu- 
dents based on a tfescription of character traits versus evaluation of 
Objective measures of clinical performance (Newble. 1976). Definitions, 
of clinical competence occur in medical school ;based on a consensus 
of opinion. Those d^lnitions are reflected ih'the assessment methods. 
Multiple measures of competence wUl likely be preferred over solitary 
measures when the issue of definition has been resolved. 

The use of multiple measures increases the chances of securing 
an accurate and sensitive evaluation of clinical students. A balance 
between objective and subjective measures in one evaluative instrument 
does not exist. Computer technology has the potential for revolution- 
izing the process of evaluating medical student clinical competence 
combining the objective with the subjective while eliminating scorer 
inconsistency. 
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